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Abstract. We design and implement a parallel algebraic multigrid method for isotropic 
graph Laplacian problems on multicore Graphical Processing Units (GPUs). The proposed 
AMG method is based on the aggregation framework. The setup phase of the algorithm uses 
a parallel maximal independent set algorithm in forming aggregates and the resulting coarse 
level hierarchy is then used in a K-cycle iteration solve phase with a £^-Jacobi smoother. 
Numerical tests of a parallel implementation of the method for graphics processors are 
presented to demonstrate its effectiveness. 



1. Introduction 

We consider development of a multilevel iterative solver for large-scale sparse linear sys- 
tems corresponding to graph Laplacian problems for graphs with balanced vertex degrees. A 
typical example is furnished by the matrices corresponding to the (finite difference) / (finite 
volume) / (finite element) discretizations of scalar elliptic equation with mildly varying coef- 
ficients on unstructured grids. 

Multigrid (MG) methods have been shown to be very efficient iterative solvers for graph 
Laplacian problems and numerous parallel MG solvers have been developed for such systems. 
Our aim here is to design an algebraic multigrid (AMG) method for solving the graph 
Laplacian system and discuss the implementation of such methods on multi-processor parallel 
architectures, with an emphasis on implementation on Graphical Processing Units (GPUs). 

The programming environment which we use in this paper is the Compute Unified Device 
Architecture (CUDA) toolkit introduced in 2006 by NVIDIA which provides a framework 
for programming on GPUs. Using this framework in the last 5 years several variants of 
Geometric Multigrid (CMC) methods have been implemented on GPUs [T71I61 [T6l IT8l [T^ [T3] 
and a high level of parallel performance for the CMC algorithms on CUDA-enabled GPUs 
has been demonstrated in these works. 

On the other hand, designing AMG methods for massively parallel heterogenous comput- 
ing platforms, e.g., for clusters of GPUs, is very challenging mainly due to the sequential 
nature of the coarsening processes (setup phase) used in AMG methods. In most AMG 
algorithms, coarse-grid points or basis are selected sequentially using graph theoretical tools 
(such as maximal independent sets and graph partitioning algorithms). Although extensive 
research has been devoted to improving the performance of parallel coarsening algorithms, 
leading to notable improvements on CPU architectures [SI EH ETf EH IHl lU [22l EH] , on a single 
GPU OEnillH], and on multiple GPUs [12j, the setup phase is still considered a bottleneck 
in parallel AMG methods. We mention the work in where a smoothed aggregation setup 
is developed in CUDA for GPUs. 

Key words and phrases, multigrid methods, unsmoothed aggregation, adaptive aggregation. 
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In this paper, we describe a parallel AMG method based on the un-smoothed aggrega- 
tion AMG (UA-AMG) method. The setup algorithm we develop and implement has several 
notable design features. A key feature of our parallel aggregation algorithm is that it first 
chooses coarse vertices using a parallel maximal independent set algorithm [TT] and then 
forms aggregates by grouping coarse level vertices with their neighboring fine level vertices, 
which, in turn, avoids ambiguity in choosing fine level vertices to form aggregates. Such a 
design eliminates both the memory write conflicts and conforms to the CUDA programming 
model. The triple matrix product needed to compute the coarse-level matrix (a main bot- 
tleneck in parallel AMG setup algorithms) simplifies significantly in the UA-AMG setting, 
reducing to summations of entries in the matrix on the finer level. The parallel reduction 
sums available in CUDA are quite an efficient tool for this task during the AMG setup phase. 
Additionally, the UA-AMG setup typically leads to low grid and operator complexities. 

In the solve phase of the proposed algorithm, a K-cycle ^Hl HI 12] is used to accelerate 
the convergence rate of the multilevel UA-AMG method. Such multilevel method optimizes 
the coarse grid correction and results in an approximate two-level method. Two parallel 
relaxation schemes considered in our AMG implementation are a damped Jacobi smoother 
and a parameter free Jacobi smoother introduced in [25j and its weighted version in [Jj. 
To further accelerate the convergence rate of the resulting K-cycle method we apply it as a 
preconditioner to a nonlinear conjugate gradient method. 

The remainder of the paper is organized as follows. In Section |2j we review the UA-AMG 
method. Then, in Section [3} a parallel graph aggregation method is introduced, which is 
our main contribution. The parallelization of the solve phase is discussed in Section |4j In 
Section |5| we present some numerical results to demonstrate the efficiency of the parallel 
UA-AMG method. 



2. Unsmoothed Aggregation AMG 

The linear system of interest has as coefficient matrix the graph Laplacian corresponding 
to an undirected connected graph Q = (y,£). Here, V denotes the set of vertices and £ 
denotes the set of edges of Q. We set n = \V\ (cardinality of V). By (-, ■) we denote the 
inner product in £^(1R") and the superscript t denotes the adjoint with respect to this inner 
product. The graph Laplacian A : IR" H- IR" is then defined via the following bilinear form 

{Au, v) = ^ ujij{ui - Uj){vi - Vj) + ^ uf UjVj, S <ZV. 
k={i,j)&£ jes 

We assume that the weights Uij, and cjj^ are strictly positive for all i and j. The first 
summation is over the set of edges S (over k E £ connecting the vertices i and j), and Ui and 
Uj are the i-th and j-th coordinate of the vector u G IR", respectively. We also assume that 
the subset of vertices S is such that the resulting matrix A is symmetric positive definite 
(SPD). If the graph is connected S could contain only one vertex and A will be SPD. For 
matrices corresponding to the discretization scalar elliptic equation on unstructured grids S 
is the set of vertices near (one edge away from) the boundary of the computational domain. 
The linear system of interest is then 



(2.1) 



Au = f. 
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With this system of equation we associate a multilevel hierarchy which consists of spaces 
Vq C Vi C . . . G Vi = IR", each of the spaces is defined as the range of interpola- 
tion/prolongation operator Pl_-^ : ]R"'"i i— )■ V/ with Range(P/_]^) = V5_i. 

Given the l-th. level matrix Ai e M"'^"', the aggregation-based prolongation matrix Pl^i 
is defined in terms of a non-overlapping partition of the n/ unknowns at level / into the 
nonempty disjoint sets G^-, j = 1, . . . , rii^i, called aggregates. An algorithm for choosing such 
aggregates is presented in the next section. The prolongation P/_^ is the ni x n;_i matrix 
with columns defined by partitioning the constant vector, 1 = (1, . . . , 1)*, with respect to 
the aggregates: 



(2.2) {PLih = {: '[l^^' ^ = l,...,m, J 

otherwise 



ni-i. 



The resulting coarse-level matrix Ai^i G M."''-^^'^'-'^ is then defined by the so called "triple 
matrix product" , namely, 

(2.3) A-i = {PUYAiPi,). 

Note that since we consider UA-AMG, the interpolation operators are boolean matrices such 
that the entries in the coarse-grid matrix v4;_i can be obtained from a simple summation 
process: 

(2.4) (Ai_i)i, = ^ ^ a,t, i,j = 1,2,... 

seGi teG-i 

Thus, the triple matrix product, typically the costly procedure in an AMG setup, simplifies 
significantly for UA-AMG to reduction sums. 

We now introduce a general UA-AMG method (see Algorithm [T]) and in the subsequent 
sections we describe the implementation of each of the components of Algorithm [T] for GPUs. 

3. The Setup Phase 

Consider the system of linear equations ( |2.1[ ) corresponding to an unweighted graph Q = 
{V,^^} partitioned into two subgraphs Qk = {Vfc,£^fe},fc = 1,2. Further assume that the 
two subgraphs are stored on separate computes. To implement a Jacobi or Gauss-Seidel 
smoother for the graph Laplacian equation with respect to Q, the communication between 
the two computers is proportional to the number of edge cuts of such a partitioning, given 
by 

\s\{s,uS2)\. 

Therefore, a partition corresponding to the minimal edge cut in the graph results in the 
fastest implementation of such smoothers. This in turn gives a heuristic argument, as also 
suggested in [21], [23], that when partitioning the graph in subgraphs (aggregates) the sub- 
graphs should have a similar number of vertices and have a small "perimeter." Such a 
partitioning can be constructed by choosing any vertex in the graph, naming it as a coarse 
vertex, and then aggregating it with its neighboring vertices. This heuristic motivates our 
aggregation method. The algorithm consists of a sequence of two subroutines: first, a paral- 
lel maximal independent set algorithm is applied to identify coarse vertices; then a parallel 
graph aggregation algorithm follows, so that subgraphs (aggregates) centered at the coarse 
vertices are formed. 
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Algorithm 1 UA-AMG 



Setup Phase: 

1: Given uq (size of the coarsest level) and L (maximum levels) 
2: I ^ L, 

3: while A/ > no & / > do 
4: Construct the aggregation A/"/, i = 1,2, 
5: Compute An by ( p^ , 
6: I ^l-l, 
7: end while 

Solve Phase: 



Ai+i based on Ai 



if (On the coarsest level) then 

solve AiUi = fi exactly, 
else 

Pre-smoothing: Ui smooth(M/, A;, /;) 
Restriction: compute ri_i = (-P/_i)"^(/i 
Coarse grid correction: solve Ai_iei_i = ri 
AMG on coarser level / — 1 and get e/_i. 
Prolongation: compute Ui Ui + P/_]^ei_i- 
Post-smoothing: ui <— smooth{ui, Ai, fi). 
end if 



AiUi), 

1 approximately by recursively calling the 



In the algorithm, to reduce repeated global memory read access and write conflicts, we 
impose explicit manual scheduling on data caching and flow control in the implementations 
of both algorithms; the aim is to achieve the following goals: 

(1) (Read access coalescence): To store the data that a node uses frequently locally or 
on a fast connecting neighboring node. 

(2) (Write conflicting avoidance): To reduce, or eliminate the situation that several nodes 
need to communicate with a center node simultaneously. 

3.1. A Maximal Independent Set Algorithm. The idea behind such algorithm is to 
simplify the memory coalescence, and design a random aggregation algorithm where there 
are as many as possible threads loading from a same memory location, while as few as possible 
threads writing to a same memory location. Therefore, it is natural to have one vertex per 
thread when choosing the coarse vertices. For vertices that are connected the corresponding 
processing threads should be wrapped together in a group. By doing so, repeated memory 
loads from the global memory can be avoided. 

However, we also need to ensure that no two coarse vertices compete for a flne level point, 
because either atomic operations as well as inter-thread communication is costly on a GPU. 
Therefore, the coarse vertices are chosen in a way that any two of them are of distance 3 
or more, which is the same as flnding a maximal independent set of vertices for the graph 
corresponding to A^, where A is the graph Laplacian of a given graph Q, so that each flne 
level vertex can be determined independently which coarse vertex it associates with. 

Given an undirected unweighted graph Q = {V,S}, we flrst flnd a set C of coarse vertices 
such that 



(3.1) 



PARALLEL UNSMOOTHED AGGREGATION ALGEBRAIC MULTIGRID ALGORITHMS ON GPUS 5 



Here, d{-, ■) is the graph distance function defined recursively as 



0, i = j; 

d{i,j) = '{ min d(k,j) + l, i^j. 

Assume we obtain such set C, or even a subset of C, we can then form aggregates, by 
picking up a vertex i in C and defining an aggregate as a set containing i and its neighbors. 



The condition (3.1) guarantees that two distinct vertices in C, do not share any neighbors. 



The operation of marking the numberings of subgraphs on the fine grid vertices is write 



confiict free, and the restriction imposed by (3.1) ensures that aggregates can be formed 
independently, and simultaneously. 

The rationale of the independent set algorithm is as follows: First, a random vector v is 
generated, each component of which corresponds to a vertex in the graph. Then we define 
the set C as the following: 

C = {i \vi> Vj.Mj : < d{i,j) < 3}. 

If C is not empty, then such construction results in a collection of vertices in C is of distance 
3 or more. Indeed, assume that d{i,j) < 3 for i, j G C, let Vi > vj. From the definition 
of the set C, we immediately conclude that i ^ C. Of course, more caution is needed 
when C defined above is empty (a situation that may occur depending on the vector v). 
However, this can be remedied, by assuming that the vector v (with random entries) has a 
global maximum, which is also a local maximum. The C contains at least this vertex. The 
same algorithm can be applied then recursively to the remaining graph (after this vertex is 
removed). In practice, C does not contain one but more vertices. 



3.2. Parallel Graph Aggregation Algorithm. We here give a description of the parallel 
aggregation algorithm, running the exact copies of the code on each thread. 

Within each pass of the Parallel Aggregation Algorithm (PAA, Algorithm [2]), the following 
two steps are applied to each vertex i. 

(A) Construct a set C which contains coarse vertices. 

(B) Construct an aggregate for each vertex in C. 

Note that these two subroutines can be executed in a parallel fashion. Indeed, step (A) does 
not need to be applied to the whole graph before starting step (B). Even if C is partially 
completed, any operation in step (B) will not interfere step (A), running on the neighboring 
vertices and completing the construction of C. A problem for this approach is that it usually 
cannot give a set of aggregates that cover the vertex set V after 1 pass of step (A) and step 
(B). We thus run several passes and the algorithm terminates when a complete cover is 
obtained. The number of passes is reduced if we make the set C as large as possible in each 
pass, therefore the quasi-random vector v needs to have a lot of local maximums. Another 
heuristic argument is that C needs to be constructed in a way that every coarse vertex has 
a large number of neighboring vertices. Numerical experiments suggest that the following is 
a good way of generating the vector v with the desired properties. 

(3.2) Vi i — quasi_random(i) := di + ((i mod 12) + rand()) /12. 

where d^ is the degree of the vertex i, and rand() generates a random number uniformly 
distributed on the interval [0, 1]. 
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Algorithm 2 Parallel Aggregation Algorithm (PAA) 



(1) Generate a quasi-random number and store it in Vi, as 

Vi i — quasi_random(i); 

mark vertex i as "unprocessed" ; wait until all threads complete these operations. 

(2) (2a) Goto (2d) if i is marked "processed", otherwise continue to (2b). 

(2b) Determine if the vertex i is a coarse vertex, by check if the following is true. 

Vi > Vj, Vj : {A'^)ij ^ and j is unprocessed . 

If so, continue to (2c); if not, goto (2d). 
(2c) Form an aggregate centered at i. Let Si be a set of vertices defined as 

= {j I Vi > : {A'^)ij 7^ and j is unprocessed }. 

Define a column vector w such that 

1, A) G Si] 
0, k^S,. 

Mark vertices j G Si "processed" and request an atomic operation to update the 
prolongator P as 

P ^ [P, w] . 

(2d) Synchronize all threads (meaning: wait until all threads reach this step). 
(2e) Stop if i is marked "processed", otherwise goto step (2a). 



3.3. Aggregation Quality Improvements. To improve the quality of the aggregates, we 
can either impose some constrains during the aggregation procedure (which we call in-line 
optimization), or introduce a post-process an existing aggregation in order to improve it. One 
in-line strategy that we use to improve the quality of the aggregation is to limit the number 
of vertices in an aggregate during the aggregation procedure. However, such limitations may 
result in a small coarsening ratio. In such case, and numerical results suggest that applying 
aggregation process twice, which is equivalent to in skipping a level in a multilevel hierarchy, 
can compensate that. 

Our focus is on a post-processing strategy, which we name "rank one optimization". It 
uses an a priori estimate to adjust the interface (boundary) of a pair of aggregates, so that 
the aggregation based two level method, with a fixed smoother, converges fast locally on 
those two aggregates. 

We consider the connected graph formed by a union of aggregates (say a pair of them, 
which will be the case of interest later), and let n be the dimension of the underlying vector 
space. Let A : IR" i-^ IR" be a semidefinite weighted graph Laplacian (representing a local 
sub-problem) and i? be a given local smoother. As is usual for semidefinite graph Laplacians, 
we consider the subspace ^^-orthogonal to the null space of A and we denote it by V. The £^ 
orthogonal projection on V is denoted here by Uy- Let S — I — RA be the error propagation 
operator for the smoother R. We consider the two level method whose error propagation 
matrix is 

E{V,) = E{V,; S) = il- QxiV,))iI - RA). 
Here V"c C is a subspace and Q-^{Vc) is the A-orthogonal projection of the elements of 
V onto the coarse space Vc- In what follows we use the notation E(Vc; S) when we want 
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to emphasize the dependence on S. We note that Q^iVc) is well defined on V because 
A is SPD on V and hence it {A-, ■) is an inner product on V. We also have that Q^{Vc) 
self-adjoint on V and under the assumption Vc C V, we obtain Q^{Vc) = HyQ^iVc) and 
^vQa(^c) = Qa(^c)^v- Also, Sy = HyS is self-adjoint on V in the {A-, ■) inner product iff 
R is self-adjoint in the £^-inner product on H". 

We now introduce the operator TiVc) (recall that Vc C V) 

T(K) = nVc, S) = S-E{Vc) = Qx{Vc){I - RA) = Qx{Vc)S. 
and from the definition of for all f G we have 

(3.3) \E{VMa = \^vE{VM\ = \^vSv\\ - \UvT{Vc)v\a = \Svv\\ - \T{Vc)v\a- 

We note the following identities which follow directly from the definitions above and the 
assumption Vc C V: 

(3.4) \EiVc; S)\^ = lUvEiV; Sv)\x, |T(K; S)\^ = \UvT{Vc; Sy)\;^. 

The relation ( |3.3 ) suggests that, in order to minimize the seminorm |i?(V"c)f |^ with respect 
to the coarse space Vc, we need to make |T(V^)|^ maximal. The following lemma quantifies 
this observation and is instrumental in showing how to optimize locally the convergence rate 
when the subspaces Vc are one dimensional. In the statement of the lemma we use argmin 
to denote a subset of minimizers of a given, not necessarily linear, functional F{x) on a space 
X. More precisely, we set 

y G argmin F(x), if and only if, F{y) = minF(a;). 
We have similar definition (with obvious changes) for the set argmaxF(x). 

Lemma 3.1. Let Sy = ^vS, be the projection of the local smoother on V , and Vc be the set 
of all one dimensional subspaces ofV. Then we have the following: 

(3.5) |5y|2= max|T(V;)U, 

(3.6) // Wc G arg max |T(K) | a> t^en Wc G arg min \E{Vc) \ ^, 

Vc £ Vc £ Vc 

where E{Vc) = (/ - QxiVc))S and T{Vc) = gi(K.)5. 

Proof. From the identities ( |3.4 ) it follows that we can restrict our considerations on C 
IR" and that we only need to prove the Lemma with E{Vc) = IlvE{Vc; Sy) and T{Vc) = 
IlvT{Vc; Sv)- In order to make the presentation more transparent, we denote | ■ | = | ■ 
n = Q^. Let us mention also that by orthogonality in this proof we mean orthogonality in 
the {A-, ■) inner product on V. The proof then proceeds as follows. 

Let (f E V he such that ISyy^l = \Sv\\f\- We set Wc = spanlSyip}. Note that for such 
choice of Wc we have Il(Wc)Sv'^ = Sy^ and hence 

\M = = < |r(wy|. 

On the other hand, for all G Vc we have |n(Vc)| = 1 and we then conclude that 

(3.7) \T{Vc)\ = \Ii{Vc)Sv\ < mVc)\\Sv\ = \Sv\ < \T{Wc)\. 
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By taking a maximum on Vc in (3.7), we conclude the following thus prove (3.5). 

\T{W,)\ < max |T(K)| < l^v^l < |T(iy,)|. 



To prove (3.6), we observe that for any Wc G arg max |T(Vc)| the inequalities in (3.7) 
become equalities and hence 

\^{W,)Sv\ = \Sv\ = \SvIl{W,)\. 
\S w\ 

This implies that \Sv\ = max — — — . It is also clear that \Svw\ = \Sv\\w\ for all w G Wc, 

w£Wc \w\ 

because Wc is one dimensional. In addition, since Sy is self-adjoint, it follows that Wc is the 
span of the eigenvector of Sy with eigenvalue of magnitude 15*^1. Next, for any Vc G Vc we 
have 

\E{Vc)\ = \{I- Il{Vc))Sv\ = \Sv{I - n(K))| = max 

vev^^ \v\ 

By the mini-max principle (see (TUl pp. 31-35] or [20]) we have that |i?(\4)| ^ 0^2, where 
(72 is the second largest singular value of Sy and with equality holding iff Vc = Wc. This 
completes the proof. ■ 
We now move on to consider a pair of aggregates. Let A be the graph Laplacian of a 
connected positively weighted graph Q which is union of two aggregates Vi and V2. Further- 
more, let Ivi be the characteristic vector for Vi, namely a vector with components equal 
to 1 at the vertices of Vi and equal to zero at the vertices of V2. Analogously we have a 
characteristic vector ly^ for V2. Finally, let K;(Vi, V2) be the space of vectors that are linear 
combinations of Ivi and ly^- More specifically, the subspace Vc is defined as 

VciVi, V2) = span I ( p^lvi - ^^Iv, 

Let Vc be the set of subspaces defined above for all possible pairs of Vi and V2, such that 
Q = ViLS V2. Note that by the definition above, every pair (Vi, V2) gives us a space Vc G Vc 
which is orthogonal to the null space of A, i.e. orthogonal to 1 = + lv2- 

We now apply the result of Lemma 3.1| and show how to improve locally the quality of 
the partition (the convergence rate |i?(V"c)|^) by reducing the problem of minimizing the 
A-norm of -E(V^) to the problem of finding the maximum of the A-norm of the rank one 
transformation T{Vc). Under the assumption that the spac es V c are orthogonal to the null 



space of A (which they satisfy by construction) from Lemma 3.1 we conclude that the spaces 
Wc which minimize |i?(V^)|^ also maximize |T(V^)|^. 

For the pair of aggregates, |T(Vc)|^ is the largest eigenvalue of S^AQ^{Vc)SA^, where A^ 
is the pseudo inverse of A. Clearly, the matrix S'^ AQ ^{Vc) S A^ , is also a rank one matrix 
and hence 

\T{Vc)\x = tT{S^AQ;i{Vc)SA^). 

During optimization steps, we calculate the trace using the fact that for any rank one 
matrix W we have 

(3.8) triW) ^ ^ ^ 

Wkk elWkCk 
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where Wkk is a nonzero diagonal entry (any nonzero diagonal entry) and Wk is the k-th 
column of W. The formula (3.8) is straightforward to prove if we set W = uv^ for two 
column vectors u and v, and also suggests a numerical algorithm. We devise a loop computing 
Wk = Wck, and W^k = e^WkCk, for = 1, . . . , m, where m is the dimension of W; The 
loop is terminated whenever Wkk 7^ 0, and we compute the trace via (3.8) for this k. In 
particular for the examples we have tested, W = S'^ AQ ^{Vc) S is usually a full matrix 
and we observed that the loop almost always terminated when k = 1. 

The algorithm which traverses all pairs of neighboring aggregates and optimizes their 
shape is as follows. 



Algorithm 3 Subgraph Reshaping Algorithm 

Input: Two set of vertices, Vi and V2, corresponding to a pair of neighboring sub- 
graphs. Output: Two sets of vertices, Vi and V2 satisfying that 

ViUV2 = ViUV2, and | |Vi| - IV2II < 1, 

and the subgraphs corresponding to Vi and V2 are both connected. 

(1) Let = |Vi| + IV2I, then compute m = [n/2\. 

(2) Run in parallel to generate all partitionings such that the vertices set 

ViUV2 = ViUV2, |Vi|=m, 

and the subgraphs derived by Vi and V2 are connected. 

(3) Run in parallel to compute the norm |T(V^)|^ for all partitionings get from step (2), 
and return the partitioning that results in maximal |T(VJ,)|^. 



The subgraph reshaping algorithm fits well the programming model of a multicore GPU. 
We demonstrate this algorithm on two example problems, and later show its potential as a 
post-process for the parallel aggregation algorithm (Algorithm [2]) outlined in the previous 
section. In the examples that follow next we use the rank one optimization and then measure 
the quality of the coarse space also by computing the energy norm of the \Q\^, where Q is 
the £^-orthogonal projection to the space Wc- 

Example 3.2. .■ Consider a graph Laplacian A corresponding to a graph which is a A x 
4 square grid. The weights on the edges are all equal to 1. We start with an obviously 
non- optimal partitioning as shown on the left of Figure [I[ of which the resulting two level 
method, consisting of i^-Jacobi pre- and post-smoothers and an exact coarse level solver, has 
a convergence rate \E\^ = 0.84, and = 1.89. After applying Algorithm^, the refined 
aggregates have the shapes shown on the right of Figure [I| of which the two level method has 
the same convergence rate \E\^ = 0.84 but the square of the energy seminorm is reduced to 
\Q\\=1.50. 

Example 3.3. Consider a graph Laplacian A corresponding to a graph which is a 4 x 4 
square grid, on which all horizontal edges are weighted 1 while all vertical edges are weighted 
10. Such graph Laplacian represents anisotropic coefficient elliptic equations with Neumann 
boundary conditions. Start with a non-optimal partitioning as shown on the left of Figure^ 
of which the resulting two level method has a convergence rate \E\-^ = 0.96 and |<5|^ = 4.88. 
After applying Algorithm \^ the refined aggregates have the shapes shown on the right of 
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Figure 1. Subgraph reshaping algorithm apphed on a graph repre- 
senting an isotropic coefficient eUiptic PDE. 



Figure^ of which the two level convergence rate is reduced to \E\^ = 0.90 and the energy of 
the coarse level projection is also reduced as = 1-50. 



Figure 2. Subgraph reshaping algorithm applied on a graph repre- 
senting an anisotropic coefficient elliptic PDE. 



4. Solve Phase 

In this section, we discuss the parallelization of the solver phase on GPU. More precisely, we 
will focus on the parallel smoother, prolongation/restriction, MG cycle, and sparse matrix- 
vector multiplication. 

4.1. Parallel Smoother. An efficient parallel smoother is crucial for the parallel AMG 
method. For the sequential AMG method, Gauss-Seidel relaxation is widely used and has 
been shown to have a good smoothing property. However the standard Gauss-Seidel is a 
sequential procedure that does not allow efficient parallel implementation. To improve the 
arithmetic intensity of the smoother and make it work better with SIMT based GPUs, we 
adopt the well-known Jacobi relaxation, and introducing a damping factor to improve the 
performance of the Jacobi smoother. For a matrix A e M*^^" and its diagonals are denoted 
hj D = diag(aii, 022, ■" " ? c^nn), the Jacobi smoother can be written in the following matrix 
form 

3,m+i = ^m^ ouD-^r"", where r"^ = 6 - Ax"", 



or component-wise 



xr' = xT + ua^rf. 



This procedure can be implemented efficiently on GPUs by assigning one thread to each 
component, and update the corresponding components locally and simultaneously. 
We also consider the so-called Jacobi smoother, which is parameter free. Define 

M = diag(Mn,M22,--- ,M„„), 

where Ma = an + da with da = Xl^yi I'^yl' ^^"^ the Jacobi has the following matrix form 

_ ^ M-V™, where r"^ = 6 - Ax™, 
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or component-wise 

In [25| [7] it has been show that if A is symmetric positive definite, the smoother is always 
convergent and has multigrid smoothing properties comparable to full Gauss-Seidel smoother 
if a-ii > (^dii and 6 is bounded away from zero. Moreover, because its formula is very similar to 
the Jacobi smoother, it can also be implemented efficiently on GPUs by assigning one thread 
to each component, and update the corresponding the component locally and simultaneously. 

4.2. Prolongation and Restriction. For UA-AMG method, the prolongation and restric- 
tion matrices are piecewise constant and characterize the aggregates. Therefore, we can 
preform the prolongation and restriction efficiently in UA-AMG method. Here, the output 
array aggregation (column index of P), which contains the information of aggregates, plays 
an important rule. 

• Prolongation: Let that the action = -P/_if' ^ can be written 
component-wise as follows: 

Assign each thread to one element of f', and the array aggregation can be used to 
obtain information about j E G-"^, i.e., i = aggregation[j], so that prolongation 
can be efficiently implemented in parallel. 

• Restriction: Let v'' e M"', so that the action (P/_^)"^f' can be written component- 
wise as follows: 

Therefore, each thread is assigned to an element of and the array aggregation 
can be used to obtain information about j G Gi~^ , i.e., to find all j such that 
aggregation[j] = i. By doing so, the action of restriction can also be implemented 
in parallel. 

4.3. K-cycle. Unfortunately, in general, UA-AMG with V-cycle is not an optimal algorithm 
in terms of convergence rate. But on the other hand, in many cases, UA-AMG using two-grid 
solver phase gives optimal convergence rate for graph Laplacian problems. This motivated us 
to use other cycles instead of V-cycle to mimic the two-grid algorithm. The idea is to invest 
more works on the coarse grid, and make the method become closer to an exact two-level 
method, then hopefully, the resulting cycle will have optimal convergence rate. 

The particular cycle we will discuss here is the so-called K-cycle (Nonlinear AMLI-cycle) 
and we refer to [29l [H [2] for details on its implementation in general. 

4.4. Sparse Matrix- Vector Multiplication on GPUs. As the K-cycle will be used as 
a preconditioner for Nonlinear Preconditioned Conjugate Gradient (NPCG) method, the 
sparse matrix-vector multiplication (SpMV) has major contribution to the computational 
work involved. An efficient SpMV algorithm on GPU requires a suitable sparse matrix 
storage format. How different storage formats perform in SpMV is extensively studied in 
[1]. This study shows that the need for coalesce accessing of the memory makes ELLPACK 
(ELL) format one of the most efficient sparse matrix storage formats on GPUs when each 
row of the sparse matrix has roughly the same nonzeros. In our study, because our main 
focus is on the parallel aggregation algorithm and the performance of the UA-AMG method, 
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we still use the compressed row storage (CSR) format, which has been widely used for the 
iterative linear solvers on CPU. Although this is not an ideal choice for GPU implementation, 
the numerical results in the next section already show the efficiency of our parallel AMG 
method. 



5. Numerical Tests 

In this section, we present numerical tests using the proposed parallel AMG methods. 
Whenever possible we compare the results with the CUSP libraries p2]. CUSP is an open 
source C++ library of generic parallel algorithms for sparse linear algebra and graph com- 
putations on CUDA-enabled CPUs. All CUSP's algorithms and implementations have been 
optimized for GPU by NVIDIA's research group. To the best of our knowledge, the parallel 
AMG method implemented in the CUSP package is the state-of-the-art AMG method on 
GPU. We use as test problems several discretizations of the Laplace equation. 



5.1. Numerical Tests for Parallel Aggregation Algorithm. Define Q, the £^ projection 
on the piece-wise constant space Range(P), as the following: 

Q = P{P'^P)-^P^. 

We present several tests showing how the energy norm of this projection changes with respect 
to different parameters used in the parallel aggregation algorithm, since the convergence rate 
is an increasing function of ||Q||a- 

The tests involving \\Q\\a further suggest two additional features necessary to get a multi- 
grid hierarchy with predictable results. First, the sizes of aggregates need to be limited, and 
second, the columns of the prolongator P need to be ordered in a deterministic way, regard- 
less of the order that aggregates are formed. The first requirement can be fulfilled simply by 
limiting the sizes of the aggregates in each pass of the parallel aggregation algorithm. We 
make the second requirement more specific. Let Ck to be the index of the coarse vertex of 
the k-th. aggregate. We require that Ck should be an increasing sequence and then use the 
/c-th column of P to record the aggregate with the coarse vertex numbered c^. This can be 
done by using a generalized version of the prefix sum algorithm [3]. 



We first show in Table 5.1 the coarsening ratios (in the parenthesis in the table) and the 
energy norms of a two grid hierarchy, for a Laplace equation with Dirichlet boundary 

conditions on a structured grid containing vertices. The limit on the size of an aggregate 
is denoted by t, which suggests that, any aggregate can include t vertices or less, which 
directly implies that the resulting coarsening ratio is less or equal to t. 





t=2 t=3 t=4 t=5 


n = 128 
n = 256 
n = 512 


(1.99) 1.71 (2.03) 2.04 (2.41) 2.46 (2.97) 3.15 
(1.99) 1.72 (2.39) 2.57 (2.96) 2.59 (2.99) 3.20 
(2.00) 1.72 (2.01) 2.08 (2.40) 2.48 (2.99) 3.22 



Table 5.1. (coarsening ratios) and energy norm of a two grid hierar- 
chy of a Laplace equation on a uniform grid with Dirichlet boundary 
conditions. 



PARALLEL UNSMOOTHED AGGREGATION ALGEBRAIC MULTIGRID ALGORITHMS ON GPUS 13 



For the same aggregations on the graphs that represent Laplace equations with Neumann 
boundary conditions, the corresponding coarsening ratio (in parenthesis) and \QW seminorms 



with respect to grid size n and hmiting threshold t are shown in Table 5.2 



t 



t 



t 



t 



n = 128 (1.99) 1.87 (2.03) 2.11 (2.41) 2.48 (2.97) 3.24 
n = 256 (1.99) 1.74 (2.39) 2.59 (2.96) 2.62 (2.99) 3.24 
n = 512 (2.00) 1.87 (2.01) 2.11 (2.40) 2.49 (2.99) 3.24 



Table 5.2. (coarsening ratios) and energy norm of a two grid hierar- 
chy of a Laplace equation on a uniform grid with Neumann boundary 
conditions. 



In Table |5.3| we present the computed bounds on the coarsening ratio and energy of a 
two level hierarchyBy energy of a two level hierarchy here, we mean the semi-norm of the 
£^ projection on the coarse space, when the fine level is an x n. Such results are valid 
for any structured grid with n"^ vertices (not just n = 126, . . . , 132) This is seen as follows: 



(1) From equation (3.2), it follows that if we consider two grids of sizes rii x rii and n2 x n2 



respectively, and such that 

{rii — n2) = mod 12, or {rii + ^2) = mod 12 

then our aggregation algorithm results in the same pattern of C points on these two grids; 
(2) As a consequence grids of size n x n for n = 126 = 6 mod 12 to n = 132 = mod 12 
give all possible coarsening patterns that can be obtained by our aggregation algorithm on 
any 2D tensor product grid. As a conclusion, the values of the coarsening ratios and the 



energy semi-norm given in Table 5.3 are valid for any 2D structured grid 



t 



t 



t = 4 



t 



n = 126 (2.00) 2.11 (2.00) 2.07 (2.36) 2.73 (2.36) 2.73 
n = 127 (1.99) 1.86 (2.01) 1.98 (2.01) 2.49 (2.01) 2.34 
n = 128 (1.99) 1.71 (2.03) 2.04 (2.41) 2.47 (2.97) 3.15 
n = 129 (1.99) 1.84 (2.02) 2.04 (2.03) 2.31 (2.02) 2.42 
n = 130 (1.99) 1.77 (2.40) 2.21 (2.92) 2.86 (2.94) 2.94 
n = 131 (1.99) 2.61 (2.01) 2.41 (2.01) 2.45 (2.00) 2.49 
n = 132 I (1.98) 2.09 (2.21) 2.81 (2.33) 2.89 (2.26) 2.94 
Table 5.3. (coarsening ratios) and energy norm of a two grid hierar- 
chy of a Laplace equation on a uniform grid with Neumann boundary 
conditions. 



We also apply this aggregation method on graphs corresponding to Laplace equations 
on 2 dimensional unstructured grids with Dirichlet or Neumann boundary conditions. The 
unstructured grids are constructed by perturbing nodes in an n x n square lattice (n = 
128, 256, 512), followed by triangulating the set of perturbed points using a Delaunay trian- 
gulation. The condition numbers of the Laplacians, derived using finite element discretization 
of the Laplace equations on the mentioned unstructured grids with Dirichlet boundary con- 
ditions, are about 1.2 x lO'', 5.0 x 10^, and 2.1 x 10^ respectively. The coarsening ratios and 
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\QW are listed in table 5.4 and 5.5 We remark here that we also apply the parallel aggre- 



gation algorithm without imposing limit on the size of an aggregate, and the corresponding 
numerical results are listed in columns named "t = oo" . 



t 



t = 4 



t = 5 



t = oo 



n = 128 
n = 256 
n = 512 



;i.80) 2.39 
;i.79) 2.39 
;i.80) 2.38 



(2.53) 2.44 
(2.52) 2.60 
(2.55) 2.67 



(3.17) 3.10 
(3.15) 3.18 
(3.19) 3.26 



(3.73) 3.46 
(3.69) 3.46 
(3.72) 3.56 



(4.91) 3.30 
(4.91) 3.41 
(4.93) 3.40 



Table 5.4. (coarsening ratios) and energy norm of a two grid hierarchy of 
a Laplace equation with Dirichlet boundary conditions discretized on an un- 
structured grid. 





t = 2 t = 3 t = A t = 5 t = oo 


n = 128 
n = 256 
n = 512 


(1.80) 2.39 (2.53) 2.54 (3.17) 3.18 (3.73) 3.48 (4.91) 3.33 

(1.79) 2.49 (2.52) 2.65 (3.15) 3.20 (3.69) 3.48 (4.91) 3.41 

(1.80) 2.47 (2.55) 2.80 (3.19) 3.27 (3.72) 3.57 (4.93) 3.53 



Table 5.5. (coarsening ratios) and energy norm of a two grid hierarchy of 
a Laplace equation with Neumann boundary conditions discretized on an un- 
structured grid. 



We note that in Table |5.4| and Table |5.5[ the coarsening ratios are not large enough to 
result in small operator complexity. We then estimate the energy norm \QW when Q is the £^ 
orthogonal projection from any level of the multigrid hierarchy to any succeeding sub-levels. 
We start with a Laplace equation on a 128^ structured square grid, set the limit of sizes of 
aggregates as t = 5 on each iteration of aggregation, and stop when the coarsest level is of 
less than 100 degrees of freedom. If denoting the finest level by level 0, and the coarsest 
level we get is level 5. The coarsening ratios and energy norm \QW between any two levels 
are shown in table 15.61 and 15.71 








1 


2 


3 4 


1 


(2.97) 3.15 








2 


(11.3) 7.34 


(3.81) 3.09 






3 


(38.6) 15.3 


(13.0) 6.74 


(3.41) 2.68 




4 


(113.8) 31.9 


(38.3) 13.9 


(10.1) 5.25 


(2.95) 3.91 


5 


(321.3) 54.5 


(108.2) 22.5 


(28.4) 9.09 


(8.33) 4.53 (2.82) 4.77 



Table 5.6. (coarsening ratios) and energy norms squares of a multigrid hier- 
archy of a Laplace equation with Dirichlet boundary conditions discretized on 
a 128^ grid. 



We observe that, on the diagonal of the table 5.6 and 5.7 the energy norms are comparable 
to the coarsening ratios, until the last level where the grid becomes highly unstructured. This 
suggests that a linear or nonlinear AMLI solving cycle can give both a good convergence 
rate and a favorable complexity. It is also observed that, on the lower triangular part of the 
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1 


2 


3 


4 


1 


(2.97) 3.23 










2 


(11.3) 7.68 


(3.81) 3.28 








3 


(38.6) 16.9 


(13.0) 7.99 


(3.41) 2.94 






4 


(113.8) 42.5 


(38.3) 20.6 


(10.1) 7.07 


(2.95) 4.23 




5 


(321.3) 98.6 


(108.2) 48.3 


(28.4) 17.0 


(8.33) 7.11 


(2.82) 5.75 



Table 5.7. (coarsening ratios) and energy norm of a multigrid hierarchy of 
a Laplace equation with Neumann boundary conditions discretized on a 128^ 
grid. 



table 5^ 5/7, the energy norms are always smaller than the corresponding coarsening ratios, 
which motives us to design in the future flexible cycles that detect and skip unnecessary 
levels. 

even if we set a limit t = 5 for the max- 



Another inspiring observation is that, in Table 5.1 



imal number of vertices in an aggregate, the resulting aggregates have an average number 
of vertices ranging from 2.97 to 2.99. We plot the aggregates of an unweighted graph corre- 
sponding to a 16 X 16 square grid on the left of Figure |3j and observe that, some aggregates 
contain 5 vertices and some contains only 1. We then use the rank one optimization discussed 
in Section 3.3 and apply subgraph reshaping algorithm (Algorithm |3]) as a post-process of 
the GPU parallel aggregation algorithm (Algorithm [2]) , and plot the resulting aggregates on 
the right of Figure |3j Since the subgraph reshaping algorithm does not change the number 
of aggregates, the coarsening ratios on the left and right of Figure [3] are identical and are 
equal to 2.72. The energy of the i"^ projection is deceased from = 2.51 (left of Figure 
11 to |g|J = 2.19 (right of Figure 
\E\^ = 0.67 (left of Figure g to \E 



3|). However, two level convergence rate increases from 
= 0.69 (right of Figure |3|. 



Figure 3. Before (left) and after (right) the Subgraph Reshaping 
Algorithm applied on partitioning given by Parallel Aggregation Al- 
gorithm. 

Some more comments on the reshaping algorithm are in order. For isotropic problems, the 
reshaping does not have significant impact of on the convergence rate because aggregation 
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obtained by standard approach already results in a good convergence rate. However, for 
anisotropic problems reshaping improves the convergence rate. In this case, starting with 
aggregates of arbitrary shape, the reshaping procedure results in aggregates aligned with the 
anisotropy and definitely improves the overall convergence rate. 

In addition, even for isotropic case, the numerical results in the manuscript indicate that 
subgraph reshaping can be essential for variety of cycling algorithms when aggressive coars- 



ening is applied. As shown in Example |3.2| and Example |3.3[ the aggregation reshaping helps 
for some isotropic and anisotropic problems when coarsening ratio is 8. In Table 5.6 and 



Table |5.7[ we observe that such coarsening ratio can be achieved by skipping every other 
level in our current multilevel hierarchy. 

Clearly, further investigation about the reshaping is needed for more general problems 
that have both anisotropic and isotropic regions. Analyzing such cases as well as testing how 
much improvement in the convergence can be achieved by subgraph reshaping for specific 
coarsening and cycling strategies are subject of an ongoing and future research. 

5.2. Numerical Tests for GPU Implementation. In this section, we perform numerical 
experiments to demonstrate the efficiency of our proposed AMG method and discuss the 
specifics related to the use of CPUs as main platform for computations. We test the parallel 
algorithm on Laplace equation discretized on quasi-uniform grids in 2D. Our test and com- 
parison platform is the NVIDIA Tesla C2070 together with a Dell computing workstation. 



Details in regard to the machine are given in Table 5.8 



CPU Type 


Intel 


CPU Clock 


2.4 GHz 


Host Memory 


16 GB 


GPU Type 


NVIDIA Tesla C2070 


GPU Clock 


575MHz 


Device Memory 


6 GB 


CUDA Capability 


2.0 


Operating System 


RedHat 


CUDA Driver 


CUDA 4.1 


Host Compiler 


gcc 4.1 


Device Compiler 


nvcc 4.1 


CUSP 


vO.3.0 



Table 5.8. Test Platform 



Because our aim is to demonstrate the improvement of our algorithm on CPUs, we con- 
centrate on comparing the method we describe here with the parallel smoothed aggregation 
AMG method implemented in the CUSP package [15]. 

We consider the standard linear finite element method for the Laplace equation on un- 



structured meshes. The results are shown in the Table 5.9 Here, CUSP uses smoothed 



aggregation AMG method with V-cycles, and our method is UA-AMG with K-cycles. The 
stoping criterion is that the i"^ norm of the relative residual is less than 10^^. According to 
the results, we can see that our parallel UA-AMG method converges uniformly with respect 
to the problem size. This is due the improved aggregation algorithm constructed by our 
parallel aggregation method and the K-cycle used in the solver phase. We can see that our 
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^DoF = 1 million 


^DoF = 4 million 




# Iter. 


Setup 


Solve 


Total 


# Iter. 


Setup 


Solve 


Total 


CUSP 


36 


0.63 


0.35 


0.98 


41 


2.38 


1.60 


3.98 


New 


19 


0.13 


0.47 


0.60 


19 


0.62 


2.01 


2.63 



Table 5.9. Comparison between the parallel AMG method in 
CUSP package (smoothed aggregation AMG with V-cycles) and 
our new parallel AMG method (UA-AMG with K-cycles). 



method is about 3 to 4 times faster in setup phase, which demonstrate the efficiency of our 
parallel aggregation algorithm. In the solver phase, due to the factor that we use K-cycle, 
which does much more work on the coarse grids, our solver phase is a little bit slower than 
the solver phase implemented in CUSP. However, the use of a K-cycle yields a uniformly 
convergent UA-AMG method, which is an essential property for designing scalable solvers. 
When the size of the problem gets larger, we expect the computational time of our AMG 
method to scale linearly whereas the AMG method in CUSP seems to grows faster than 
linear and will be slower than our solver phase eventually. Overall, our new AMG solver is 
about 1.5 times faster than the smoothed aggregation AMG method in CUSP in terms of 
total computational time, and the numerical tests suggests that it converges uniformly for 
the Poisson problem. 
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